library(tidyverse)
Registered S3 methods overwritten by 'dbplyr':
method from
print.tbl_lazy
print.tbl_sql
── Attaching packages ────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.2 ──✔ ggplot2 3.4.0 ✔ purrr 0.3.5
✔ tibble 3.1.8 ✔ dplyr 1.0.10
✔ tidyr 1.2.1 ✔ stringr 1.4.1
✔ readr 2.1.3 ✔ forcats 0.5.2 ── Conflicts ───────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
library(janitor)
Attaching package: ‘janitor’
The following objects are masked from ‘package:stats’:
chisq.test, fisher.test
library(lubridate)
Attaching package: ‘lubridate’
The following objects are masked from ‘package:base’:
date, intersect, setdiff, union
library(GGally)
Registered S3 method overwritten by 'GGally':
method from
+.gg ggplot2
library(ggfortify)
library(modelr)
library(tidytext)
library(wordcloud)
Loading required package: RColorBrewer
library(leaflet)
Registered S3 methods overwritten by 'htmltools':
method from
print.html tools:rstudio
print.shiny.tag tools:rstudio
print.shiny.tag.list tools:rstudio
Registered S3 method overwritten by 'htmlwidgets':
method from
print.htmlwidget tools:rstudio
library(ggridges)
library(ggthemes)
library(relaimpo)
Loading required package: MASS
Attaching package: ‘MASS’
The following object is masked from ‘package:dplyr’:
select
Loading required package: boot
Loading required package: survey
Loading required package: grid
Loading required package: Matrix
Attaching package: ‘Matrix’
The following objects are masked from ‘package:tidyr’:
expand, pack, unpack
Loading required package: survival
Attaching package: ‘survival’
The following object is masked from ‘package:boot’:
aml
Attaching package: ‘survey’
The following object is masked from ‘package:graphics’:
dotchart
Loading required package: mitools
This is the global version of package relaimpo.
If you are a non-US user, a version with the interesting additional metric pmvd is available
from Ulrike Groempings web site at prof.beuth-hochschule.de/groemping.
library(tm)
Loading required package: NLP
Attaching package: ‘NLP’
The following object is masked from ‘package:ggplot2’:
annotate
library(SnowballC)
library(wordcloud)
library(RColorBrewer)
The data is about New York City Airbnb listings in 2019.
The data includes information on, prices, New York neighbourhoods and room types, to name a few. Geospatial coordinates are also present, offering valuable insight into Airbnb locations.
Primary aim: - Analyse the determinants of airbnb prices - Offer recommendations for customers booking airbnbs - Provide insights for hosts - adopt an appropriate pricing strategy
prices <- read_csv("raw_data/AB_NYC_2019.csv")
Rows: 48895 Columns: 16── Column specification ───────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (5): name, host_name, neighbourhood_group, neighbourhood, room_type
dbl (10): id, host_id, latitude, longitude, price, minimum_nights, number_of_reviews, reviews_per_month, calculate...
date (1): last_review
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
prices %>%
summarise(n = mean(reviews_per_month, na.rm = TRUE))
The data contains information on host names and unique IDs. To avoid any ethical issues I chose to remove these variables.
# check for missing values
prices %>%
summarise(
across(
.cols = everything(),
.fns = ~sum(is.na(.x))
)
) %>%
dplyr::select(c(name, host_name, last_review, reviews_per_month))
Many missing values in last_review and reviews_per_month (10052 rows)
prices_df <- prices %>%
# drop host names and host_id (ethical)
dplyr::select(-c(host_name, host_id, id)) %>%
# take out month from last_review
mutate(last_review_month = month(last_review, label = TRUE),
# take out year from last_review
last_review_year = year(last_review),
# take name length
name_length = str_length(name)) %>%
# drop NA values
drop_na() %>%
dplyr::select(-c(last_review, last_review_year))
# number of bookings per borough
prices_df %>%
group_by(neighbourhood_group) %>%
summarise(num_bookings = n()) %>%
arrange(desc(num_bookings)) %>%
ggplot(aes(reorder(neighbourhood_group, num_bookings), num_bookings,
fill = neighbourhood_group)) +
geom_col(show.legend = FALSE) +
geom_label(mapping = aes(label = num_bookings), size = 3,
fill = "#F5FFFA", fontface = "bold", hjust = 0.5) +
theme_classic() +
labs(x = "Borough", y = "Number of Bookings",
title = "Total Number of Bookings per Borough")
Manhattan and Brooklyn both by far the most popular areas for Airbnb listings.
Two central Boroughs which may indicate the main reason people book is for holidays / Tourism.
# Average Price per Room
price_per_room <- prices_df %>%
group_by(room_type) %>%
summarise(avg_price = mean(price)) %>%
ggplot(aes(room_type, avg_price, fill = room_type)) +
geom_col(show.legend = FALSE) +
geom_label(mapping = aes(label = round(avg_price, 2)), size = 6,
fill = "#F5FFFA", fontface = "bold",
position = position_stack(vjust = 0.9)) +
theme_classic() +
labs(x = "Room Type", y = "Average Price ($)", title = "Average Price by Room Type")
# Average price per neighbourhood group
price_room_borough <- prices_df %>%
group_by(neighbourhood_group, room_type) %>%
summarise(avg_price = mean(price)) %>%
arrange(desc(avg_price)) %>%
ggplot(aes(reorder(neighbourhood_group, avg_price), avg_price,
fill = room_type)) +
geom_col() +
geom_label(mapping = aes(label = round(avg_price, 2)), size = 2.5,
fill = "#F5FFFA", fontface = "bold", hjust = 0.5,
position = position_stack(vjust = 0.9)) +
theme_classic() +
labs(x = "New York Boroughs", y = "Average Price ($)",
title = "Average Price per Borough by Room Type") +
facet_wrap(~room_type) +
coord_flip()
`summarise()` has grouped output by 'neighbourhood_group'. You can override using the `.groups` argument.
cowplot::plot_grid(price_per_room, price_room_borough, nrow = 2)
# top 10 most expensive districts on average
a <- prices_df %>%
group_by(neighbourhood) %>%
summarise(avg_price = mean(price)) %>%
arrange(desc(avg_price)) %>%
slice(1:10) %>%
ggplot(aes(reorder(neighbourhood, avg_price), avg_price, fill = neighbourhood)) +
geom_col(show.legend = FALSE) +
geom_label(mapping = aes(label = round(avg_price, 2)), size = 3,
fill = "#F5FFFA", fontface = "bold") +
# coord_flip() +
theme_classic() +
theme(axis.text.x = element_text(angle = 30, vjust = 0.95, hjust = 1)) +
labs(x = "Neighbourhood", y = "Average Price ($)",
title = "The 10 Most Expensive Districts on Average")
# top 10 least expensive districts on average
b <- prices_df %>%
group_by(neighbourhood) %>%
summarise(avg_price = mean(price)) %>%
arrange(avg_price) %>%
slice(1:10) %>%
ggplot(aes(reorder(neighbourhood, avg_price), avg_price, fill = neighbourhood)) +
geom_col(show.legend = FALSE) +
geom_label(mapping = aes(label = round(avg_price, 2)), size = 3,
fill = "#F5FFFA", fontface = "bold") +
# coord_flip() +
theme_classic() +
theme(axis.text.x = element_text(angle = 30, vjust = 0.95, hjust = 1)) +
labs(x = "Neighbourhood", y = "Average Price ($)",
title = "The 10 Least Expensive Districts on Average")
cowplot::plot_grid(a, b, nrow=2)
The 10 most expensive districts are all located in Manhattan apart from, Neponsit (Queens) and WillowBrook (Staten Island)
The majority of the 10 least expensive districts reside in the Bronx, State Island and Queens
# Average Reviews per Month by Last Month review was left
prices_df %>%
mutate(last_review_month = as_factor(last_review_month)) %>%
group_by(last_review_month) %>%
summarise(n = mean(reviews_per_month, na.rm = FALSE)) %>%
ggplot(aes(last_review_month, n, fill = last_review_month)) +
geom_col(show.legend = FALSE) +
theme_bw() +
labs(x = "Last Review Month", y = "Average Reviews per Month",
title = "Average Reviews by Last Month Review Submitted")
Suggests that most people are leaving reviews in the summer, indicating some seasonality to Airbnb booking in New York.
ggplot(subset(prices_df, price < 500),aes(x = price)) +
geom_density(
mapping = aes(fill = neighbourhood_group),
bandwidth = 100,
alpha = 1, size = 0.5, show.legend = FALSE) +
theme_bw() +
# scale_fill_economist() +
facet_wrap(~neighbourhood_group) +
labs(x = "Price", y = "Density", title = "Price Density by Borough")
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
Please use `linewidth` instead.Warning: Ignoring unknown parameters: `bandwidth`
Pricing density plot reveals that boroughs with fewer amount of bookings (Queens, Staten Island and Bronx) have a higher density of lower prices
Most common areas (Mahattan and Brooklyn) have a wider density plot indicating that prices vary more.
I want to find the words that are associated with different price ranges.
So I need to create new variables which classify the price range of each airbnb
we will define price ranges based around the average price for total bookings
prices_df %>%
summarise(mean_price = mean(price))
mean price is $142 therefore, low will be less than $100, medium will be between $100 - $200, high will be $200 - $300 and very high will be greater than $300
prices_new <- prices_df %>%
mutate(price_class = case_when(
price < 100 ~ "Low",
price < 200 ~ "Medium",
price < 300 ~ "High",
TRUE ~ "Very High"
))
# words associated with high prices
high_price_words <- prices_new %>%
# filter for High and Very High price classes
filter(price_class %in% c("High", "Very High")) %>%
# take individual words from the name column
unnest_tokens(word, name) %>%
# take out stop_words
anti_join(stop_words) %>%
# select word column
dplyr::select(word)
Joining, by = "word"
sorted_hp_words <- high_price_words %>%
count(word, sort = TRUE)
wordcloud(words = sorted_hp_words$word, freq = sorted_hp_words$n, min.freq = 1,
max.words=60, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Spectral"))
Words that stand out from wordcloud include: bedroom, apartment, village, luxury, location, Manhattan, spacious, park - start to understand what kind of Airbnbs are being advertised for high prices
# words associated with low prices
low_price_words <- prices_new %>%
filter(price_class %in% "Low") %>%
unnest_tokens(word, name) %>%
anti_join(stop_words) %>%
dplyr::select(word)
Joining, by = "word"
sorted_lp_words <- low_price_words %>%
count(word, sort = TRUE)
wordcloud(words = sorted_lp_words$word, freq = sorted_lp_words$n, min.freq = 1, max.words=60,
random.order=FALSE, rot.per=0.35, colors=brewer.pal(8, "Spectral"))
when we look at the word cloud of the low price range, we see some similarities with the high price range indicating owners are trying to sell the property as up market.
Big emphasis on property being “private” which is likely to be a big concern for people when not paying that much. In contrast, privacy is a given when paying for high end accommodation.
# words associated with medium price range (around the average)
medium_price_words <- prices_new %>%
filter(price_class %in% "Medium") %>%
unnest_tokens(word, name) %>%
anti_join(stop_words) %>%
dplyr::select(word)
Joining, by = "word"
sorted_mp_words <- medium_price_words %>%
count(word, sort = TRUE)
wordcloud(words = sorted_mp_words$word, freq = sorted_mp_words$n, min.freq = 1, max.words=60,
random.order=FALSE, rot.per=0.35, colors=brewer.pal(8, "Spectral"))
Not a tremendous amount of difference here, probably as expected it takes a balance between low and high price ranges highlighting privacy as important but also more emphasis on location.
Analysis of the most popular word combinations
# Obtain the 2 word bigrams
bigram_names <- prices_new %>%
unnest_tokens(bigram, name, token = "ngrams", n = 2)
# Seperate into two words
bigrams_separated <- bigram_names %>%
separate(bigram, c("word1", "word2"), sep = " ")
# filter the words for stop words
bigrams_filtered <- bigrams_separated %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word)
# count and sort the most popular words
bigrams_counts <- bigrams_filtered %>%
count(word1, word2, sort = TRUE)
# most popular bigrams graph
bigrams_counts %>%
# filter out numbers
filter(!str_detect(word1, "[0-9]")) %>%
# unite both word columns
unite(col = "bigrams", c("word1", "word2"), sep = " ", remove = TRUE) %>%
# get top 10
slice(1:10) %>%
# visualise
ggplot(aes(reorder(bigrams, n), n, fill = bigrams)) +
geom_col(show.legend = FALSE) +
coord_flip() +
theme_bw() +
geom_label(mapping = aes(label = n), size = 3,
fill = "#F5FFFA", fontface = "bold") +
labs(y = "Number of Bigram Mentions", x = "Bigram",
title = "Most Popular Bigrams in Airbnb Name")
Hosts emphasis on location is a common strategy to entice consumers
consumers want to be closer to famous landmarks.
Can we see if the geo-spatial backs up the word cloud and bigram analysis of emphasis on location for medium and low price ranges
# Density of New York Boroughs
prices_new %>%
ggplot(aes(longitude, latitude)) +
geom_density2d(aes(colour = neighbourhood_group)) +
theme_bw() +
labs(title = "Density of Boroughs")
# Density of Price Class
prices_new %>%
ggplot(aes(longitude, latitude)) +
geom_density2d(aes(colour = price_class)) +
theme_bw() +
labs(title = "Density of Price Class")
#cowplot::plot_grid(d, c, nrow=1)
can see the spread of prices based on the area, can see that lower prices tend to located to the perimeter of New York, indicating that location towards centre of Manhattan is a large determinant of price.
pal <- colorFactor(palette = c("orange", "white", "yellow", "red"),
domain = prices_df$price_class)
Warning: Unknown or uninitialised column: `price_class`.
leaflet(data = prices_new) %>%
addProviderTiles(providers$CartoDB.DarkMatterNoLabels) %>%
addCircleMarkers(~longitude, ~latitude, color = ~pal(price_class), weight = 1,
radius=1, fillOpacity = 0.1, opacity = 0.1) %>%
addLegend("bottomright", pal = pal, values = ~price_class,
title = "Price Class",
opacity = 1
)
Leaflet map shows the contrast between extremities and city centre, evidently much higher prices in the city.
Summary
Prices are being impacted by: - location - type of accommodation
Approach: - manual - good way to become familiar with data - avoid over or under fitting the model
Goals of the model:
Dependent variable: Price
Important to investigate its distribution as it may require a transformation to create a better fit for the model
# distribution of price
prices_df %>%
ggplot(aes(price)) +
geom_histogram(bins = 30, aes(y = ..density..), fill = "red") +
geom_density(alpha = 0.2, fill = "red") +
theme_bw() +
labs(title = "Distribution of Price")
Price is heavily skewed to the right indicates a log transformation is needed to get a normal bell shaped distribution
# mean price
mean_price <- prices_new %>%
summarise(m_price = mean(price))
# log bell shaped distribution
prices_df %>%
ggplot(aes(price)) +
geom_histogram(bins = 30, aes(y = ..density..), fill = "red") +
geom_density(alpha = 0.2, fill = "red") +
geom_vline(data = mean_price, aes(xintercept = m_price), size = 1,
linetype = 2) +
theme_bw() +
scale_x_log10() +
labs(title = "Distribution of log(Price)")
Dropped variables:
Dummy variables created from text analysis:
prices_reg_df <- prices_new %>%
# create dummy variables for chosen words that may impact price
mutate(apartment_ad = if_else(str_detect(name, "[Aa]partment"), "YES", "NO"),
private_ad = if_else(str_detect(name, "[Pp]rivate"), "YES", "NO"),
central_park_ad = if_else(str_detect(name, "[Cc]entral [Pp]ark"),
"YES", "NO"),
) %>%
dplyr::select(-c(name, neighbourhood, price_class)) %>%
# log transform price
mutate(log_price = log(price + 1)) %>%
# bring price to the front
dplyr::select(log_price,price, everything()) %>%
# change all character variables to factor
mutate(across(.cols = is.character,
.fns = as_factor)) %>%
dplyr::select(-price)
Warning: Use of bare predicate functions was deprecated in tidyselect 1.1.0.
Please use wrap predicates in `where()` instead.
# Was:
data %>% select(is.character)
# Now:
data %>% select(where(is.character))
Check alias() function to check for multicollinearity
alias(lm(log_price ~ ., data = prices_reg_df))
Model :
log_price ~ neighbourhood_group + latitude + longitude + room_type +
minimum_nights + number_of_reviews + reviews_per_month +
calculated_host_listings_count + availability_365 + last_review_month +
name_length + apartment_ad + private_ad + central_park_ad
Data is now ready to be used in regression analysis
Use ggpairs() to investigate which variables are highly correlated with price.
Data set is large so relationship analysis will be divided into numeric and non-numeric datatypes
variable_numeric <- prices_reg_df %>%
select_if(is.numeric)
ggpairs(variable_numeric)
variable_nonnumeric <- prices_reg_df %>%
select_if(function(x) !is.numeric(x))
variable_nonnumeric$log_price <- prices_reg_df$log_price
ggpairs(variable_nonnumeric)
Largest correlation with price: longitude - negatively correlated by 0.155 - statistically significant at the 0.001 level of significance.
The non-numeric ggpairs suggests that several variables will be able to explain price variance.
prices_reg_df %>%
ggplot(aes(longitude, log_price)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, colour = "red") +
theme_bw()
create model
model_1 <- lm(log_price ~ longitude, data = prices_reg_df)
autoplot(model_1)
graph 1 (population) - tell us most observations are independence, blue line declines at the end indicating other chunk of observations are not independently distributed. Need another variable.
graph 2 (distribution) - shows a fairly normal distribution, again, a bend at the end indicates model needs more to give a better distribution
graph 3 (homoskedasticity) - There is heteroskedasticity in the model indicated my curve on blue line
graph 4 (outliers) - There is a small number of outliers, and points are not highly leveraged.
summary(model_1)
Call:
lm(formula = log_price ~ longitude, data = prices_reg_df)
Residuals:
Min 1Q Median 3Q Max
-4.7215 -0.4287 -0.0309 0.3813 4.6629
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -335.83166 5.02723 -66.80 <2e-16 ***
longitude -4.60491 0.06798 -67.74 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.6256 on 38835 degrees of freedom
Multiple R-squared: 0.1057, Adjusted R-squared: 0.1056
F-statistic: 4589 on 1 and 38835 DF, p-value: < 2.2e-16
coefficient:
R-squared - longitude explains 10.6% of the variance of airbnb price.
From the ggpairs plot and previous analysis, neighbourhood_group is suggests having an impact on airbnb prices, I will add this next.
prices_reg_df %>%
ggplot(aes(neighbourhood_group, log_price, fill = neighbourhood_group)) +
geom_boxplot(show.legend = FALSE) +
theme_bw() +
labs(title = "Relationship between Price and Borough")
model_2 <- lm(log_price ~ longitude + neighbourhood_group, data = prices_reg_df)
autoplot(model_2)
graph 1 - independent population (achieved) graph 2 - still a skew in distribution graph 3 - Homoskdasticity (achieved) graph 4 - No highly leveraged points, but there are still (potentially) a few outliers
summary(model_2)
Call:
lm(formula = log_price ~ longitude + neighbourhood_group, data = prices_reg_df)
Residuals:
Min 1Q Median 3Q Max
-4.5931 -0.4275 -0.0334 0.3609 4.6367
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.283e+02 7.499e+00 -43.780 < 2e-16 ***
longitude -4.501e+00 1.014e-01 -44.391 < 2e-16 ***
neighbourhood_groupManhattan 2.790e-01 7.033e-03 39.667 < 2e-16 ***
neighbourhood_groupQueens 1.506e-01 1.294e-02 11.643 < 2e-16 ***
neighbourhood_groupStaten Island -9.425e-01 3.777e-02 -24.956 < 2e-16 ***
neighbourhood_groupBronx -6.477e-02 2.201e-02 -2.942 0.00326 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.6038 on 38831 degrees of freedom
Multiple R-squared: 0.167, Adjusted R-squared: 0.1669
F-statistic: 1557 on 5 and 38831 DF, p-value: < 2.2e-16
coefficients:
all statistically significant at levels of significance - can reject the null hypothesis and say that coefficients are statistically different from zero.
An increase in Manhattan by one unit (zero to one) is associated with a change in price by (e^0.27-1) * 100 = 31%, holding all other factors constant
An increase in Staten Island by one unit (zero to one) is associated with a change in price by (e^-0.94 - 1) * 100 = -60.9%, holding all other factors constant
anova(model_1, model_2)
Analysis of Variance Table
Model 1: log_price ~ longitude
Model 2: log_price ~ longitude + neighbourhood_group
Res.Df RSS Df Sum of Sq F Pr(>F)
1 38835 15198
2 38831 14156 4 1042.4 714.85 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
anova function confirms the neighbourhood_group dummy variable was good to include in the model.
before adding a new variable, must check the residuals of this model against the remaining variables in the dataset
residuals <- prices_reg_df %>%
add_residuals(model_2) %>%
dplyr::select(-c(log_price, longitude, neighbourhood_group))
# seperate into numeric and non-numeric
price_resid_numeric <- residuals %>%
select_if(is.numeric)
price_resid_nonnumeric <- residuals %>%
select_if(function(x) !is.numeric(x))
price_resid_nonnumeric$resid <- residuals$resid
ggpairs(price_resid_numeric)
ggpairs(price_resid_nonnumeric)
From numeric variables, availability_365 has the highest correlation with price a positive correlation of 0.131 (statistically significant).
However, the non-numeric variables indicate that room type may present a better explanation of price variance.
avail_plot <- price_resid_numeric %>%
ggplot(aes(availability_365, resid)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, colour = "red") +
theme_bw() +
labs(x = "Residuals", y = "Yearly Room Availability",
title = "Relationship Residuals and Availability") +
theme(plot.title = element_text(size=10))
room_type_plot <- price_resid_nonnumeric %>%
ggplot(aes(room_type, resid, fill = room_type)) +
geom_boxplot(show.legend = FALSE) +
theme_bw() +
labs(title = "Relationship Residuals and Room Type",
x = "Room Type", y = "Residuals") +
theme(plot.title = element_text(size=10))
cowplot::plot_grid(avail_plot, room_type_plot, nrow = 1)
`geom_smooth()` using formula = 'y ~ x'
The comparison suggests room type should offer more explanation.
model_3 <- lm(log_price ~ longitude + neighbourhood_group + room_type,
data = prices_reg_df)
autoplot(model_3)
graph 1 - residuals are independent graph 2 - the residuals seem to be increasingly not distributed around zero with more skew at the beginning and end graph 3 - conditional variance of residuals is constant (homoskedasticity)
summary(model_3)
Call:
lm(formula = log_price ~ longitude + neighbourhood_group + room_type,
data = prices_reg_df)
Residuals:
Min 1Q Median 3Q Max
-4.9468 -0.2982 -0.0401 0.2313 4.9777
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.133e+02 5.899e+00 -36.160 < 2e-16 ***
longitude -2.942e+00 7.978e-02 -36.873 < 2e-16 ***
neighbourhood_groupManhattan 2.397e-01 5.495e-03 43.611 < 2e-16 ***
neighbourhood_groupQueens 1.183e-01 1.010e-02 11.709 < 2e-16 ***
neighbourhood_groupStaten Island -6.894e-01 2.951e-02 -23.358 < 2e-16 ***
neighbourhood_groupBronx -5.006e-02 1.718e-02 -2.914 0.00357 **
room_typeEntire home/apt 7.443e-01 4.943e-03 150.572 < 2e-16 ***
room_typeShared room -3.956e-01 1.659e-02 -23.841 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.4711 on 38829 degrees of freedom
Multiple R-squared: 0.493, Adjusted R-squared: 0.4929
F-statistic: 5393 on 7 and 38829 DF, p-value: < 2.2e-16
As anticipated, room_type greatly enhanced the explanation of the model. The R-squared suggests the model explains 49.3% of the variance in price.
The variable coefficients are all statistically significant therefore can be interpreted.
residuals <- prices_reg_df %>%
add_residuals(model_3) %>%
dplyr::select(-c(log_price, longitude, neighbourhood_group, room_type))
# seperate into numeric and non-numeric
price_resid_numeric <- residuals %>%
select_if(is.numeric)
price_resid_nonnumeric <- residuals %>%
select_if(function(x) !is.numeric(x))
price_resid_nonnumeric$resid <- residuals$resid
ggpairs(price_resid_numeric)
ggpairs(price_resid_nonnumeric)
As before, availability is displaying by far the strongest correlation, and there does not seem to be any stand out non-numeric variables. Therefore, room availability will be used in the next model.
model_4 <- lm(log_price ~ longitude + neighbourhood_group + room_type + availability_365,
data = prices_reg_df)
autoplot(model_4)
graph 1 - model population is independently distributed graph 2 - still skew at both ends of graph, indicating graph is only somewhat normally distributed graph 3 - conditional variance of residuals is constant (homoskedastic)
summary(model_4)
Call:
lm(formula = log_price ~ longitude + neighbourhood_group + room_type +
availability_365, data = prices_reg_df)
Residuals:
Min 1Q Median 3Q Max
-4.9243 -0.2919 -0.0406 0.2328 5.0684
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.274e+02 5.815e+00 -39.107 < 2e-16 ***
longitude -3.131e+00 7.863e-02 -39.819 < 2e-16 ***
neighbourhood_groupManhattan 2.332e-01 5.408e-03 43.122 < 2e-16 ***
neighbourhood_groupQueens 1.041e-01 9.943e-03 10.466 < 2e-16 ***
neighbourhood_groupStaten Island -7.852e-01 2.915e-02 -26.939 < 2e-16 ***
neighbourhood_groupBronx -7.984e-02 1.692e-02 -4.719 2.38e-06 ***
room_typeEntire home/apt 7.438e-01 4.861e-03 153.004 < 2e-16 ***
room_typeShared room -4.276e-01 1.634e-02 -26.166 < 2e-16 ***
availability_365 6.674e-04 1.840e-05 36.271 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.4633 on 38828 degrees of freedom
Multiple R-squared: 0.5096, Adjusted R-squared: 0.5095
F-statistic: 5043 on 8 and 38828 DF, p-value: < 2.2e-16
R-squared has only increased marginally to 0.51 - model explains 51% of the variance in price
All variables in the model are statistically significant
potential terms:
Through process of elimination, longitude:neighbourhood_group
model_5 <- lm(log_price ~ longitude + neighbourhood_group + room_type +
availability_365 + longitude:neighbourhood_group,
data = prices_reg_df)
price_resid <- prices_reg_df %>%
add_residuals(model_5) %>%
dplyr::select(-log_price)
coplot(resid ~ longitude | neighbourhood_group,
panel = function(x, y, ...){
points(x, y)
abline(lm(y ~ x), col = "blue")
},
data = price_resid, rows = 1)
autoplot(model_5)
graph 1 - population is independent graph 2 - model still not completely evenly distributed around 0 graph 3 - conditional variance of residuals is constant (homoskedasticity) graph 4 - there are still some outliers, but no highly leveraged points
summary(model_5)
Call:
lm(formula = log_price ~ longitude + neighbourhood_group + room_type +
availability_365 + longitude:neighbourhood_group, data = prices_reg_df)
Residuals:
Min 1Q Median 3Q Max
-4.9155 -0.2856 -0.0436 0.2300 5.1577
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.378e+02 1.031e+01 -23.065 < 2e-16 ***
longitude -3.272e+00 1.394e-01 -23.468 < 2e-16 ***
neighbourhood_groupManhattan -3.219e+02 1.557e+01 -20.677 < 2e-16 ***
neighbourhood_groupQueens 1.761e+02 1.349e+01 13.053 < 2e-16 ***
neighbourhood_groupStaten Island 3.119e+02 5.582e+01 5.587 2.33e-08 ***
neighbourhood_groupBronx 2.388e+02 3.601e+01 6.631 3.38e-11 ***
room_typeEntire home/apt 7.255e-01 4.820e-03 150.506 < 2e-16 ***
room_typeShared room -4.263e-01 1.609e-02 -26.491 < 2e-16 ***
availability_365 6.323e-04 1.815e-05 34.842 < 2e-16 ***
longitude:neighbourhood_groupManhattan -4.354e+00 2.105e-01 -20.689 < 2e-16 ***
longitude:neighbourhood_groupQueens 2.382e+00 1.825e-01 13.053 < 2e-16 ***
longitude:neighbourhood_groupStaten Island 4.219e+00 7.533e-01 5.601 2.15e-08 ***
longitude:neighbourhood_groupBronx 3.233e+00 4.874e-01 6.633 3.32e-11 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.4562 on 38824 degrees of freedom
Multiple R-squared: 0.5246, Adjusted R-squared: 0.5244
F-statistic: 3570 on 12 and 38824 DF, p-value: < 2.2e-16
Model Summary:
All explanatory variables are statistically significant
R-squared: model explains 52.5% of variance in price
Room_type had the greatest influence on the model
calc.relimp(model_5, type = "lmg", rela = TRUE)
Response variable: log_price
Total response variance: 0.4375758
Analysis based on 38837 observations
12 Regressors:
Some regressors combined in groups:
Group neighbourhood_group : neighbourhood_groupManhattan neighbourhood_groupQueens neighbourhood_groupStaten Island neighbourhood_groupBronx
Group room_type : room_typeEntire home/apt room_typeShared room
Group longitude:neighbourhood_group : longitude:neighbourhood_groupManhattan longitude:neighbourhood_groupQueens longitude:neighbourhood_groupStaten Island longitude:neighbourhood_groupBronx
Relative importance of 5 (groups of) regressors assessed:
neighbourhood_group room_type longitude:neighbourhood_group longitude availability_365
Proportion of variance explained by model: 52.46%
Metrics are normalized to sum to 100% (rela=TRUE).
Relative importance metrics:
lmg
neighbourhood_group 0.15650767
room_type 0.66297003
longitude:neighbourhood_group 0.03891208
longitude 0.11782702
availability_365 0.02378320
Average coefficients for different model sizes:
1group 2groups 3groups 4groups 5groups
longitude -4.6049149739 -4.1229356833 -4.051358e+00 -3.897941e+00 -3.271661e+00
neighbourhood_groupManhattan 0.3819453426 0.3211757943 -1.070997e+02 -2.511174e+02 -3.218665e+02
neighbourhood_groupQueens -0.2087349654 -0.0655638368 8.477795e+01 1.718999e+02 1.760652e+02
neighbourhood_groupStaten Island -0.2501789720 -0.4947127336 9.315515e+01 2.287431e+02 3.118768e+02
neighbourhood_groupBronx -0.3667919672 -0.2372578195 1.284316e+02 2.514425e+02 2.387853e+02
room_typeEntire home/apt 0.8196414599 0.7858184226 7.602355e-01 7.344142e-01 7.254613e-01
room_typeShared room -0.3741098014 -0.3871140671 -4.059261e-01 -4.118401e-01 -4.263369e-01
availability_365 0.0003947914 0.0005442186 6.232164e-04 6.210777e-04 6.323326e-04
longitude:neighbourhood_groupManhattan NaN NaN -5.805751e+00 -5.096923e+00 -4.354219e+00
longitude:neighbourhood_groupQueens NaN NaN 4.586125e+00 3.487592e+00 2.381855e+00
longitude:neighbourhood_groupStaten Island NaN NaN 5.069942e+00 4.647986e+00 4.219434e+00
longitude:neighbourhood_groupBronx NaN NaN 6.959079e+00 5.105896e+00 3.232856e+00
Variables of relative importance:
Room type, perhaps unsurprisingly, is the most relevant variable for explaining variations in price. The least explanatory is availability_365
For Consumers:
For hosts: - How you describe your property doesn’t translate to being able to charge higher prices - will be able to charge a much higher price for property listed as whole apartment - property availability does not have much impact on price
More data on:
-yearly data - time-series analysis - identify cyclical trends